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Abstract 

Representation learning systems typically rely on massive amounts of labeled data 
in order to be trained to high accuracy. Recently, high-dimensional parametric 
models like neural networks have succeeded in building rich representations using 
either compressive, reconstructive or supervised criteria. However, the semantic 
structure inherent in observations is oftentimes lost in the process. Human percep¬ 
tion excels at understanding semantics but cannot always be expressed in terms of 
labels. Thus, oracles or human-in-the-loop systems, for example crowdsourcing, 
are often employed to generate similarity constraints using an implicit similar¬ 
ity function encoded in human perception. In this work we propose to combine 
generative unsupervised feature learning with a probabilistic treatment of oracle 
information like triplets in order to transfer implicit privileged oracle knowledge 
into explicit nonlinear Bayesian latent factor models of the observations. We use 
a fast variational algorithm to learn the joint model and demonstrate applicabil¬ 
ity to a well-known image dataset. We show how implicit triplet information can 
provide rich information to learn representations that outperform previous metric 
learning approaches as well as generative models without this side-information in 
a variety of predictive tasks. In addition, we illustrate that the proposed approach 
compartmentalizes the latent spaces semantically which allows interpretation of 
the latent variables. 


1 Introduction 


Machine Learning excels in its ability to model large quantities of data with layered non-linear 
feature-learning systems for purposes such as classification and understanding of images, scenes, 
videos, text and more structured objects. Commonly, many successes are owed due to excessive 
availability of labels coupled with supervised learning. In other successful cases, the structure of 
the data is being used as a means to hard-code wiring for models, for instance modeling video 
using slowness and convolutions in images. Oftentimes, especially in the case of perception, a) the 
real structure of the data generating process is unknown and hard to explicitly model well, or b) 
large amounts of accurate labels are hard to come by or may even be inadequate for knowledge 
representation. One way to incoporate further information is to query oracles like crowds to gather 
cheap labels or to collect auxiliary information like similarity constraints accoring to undefined 
perceptual biases the crowd may be aware of. While labeling may be noisy or inadequate to represent 
knowledge, similarity constraints present a robust way to encode implicit information about various 
properties of stimuli. 


We propose to take advantage of auxiliary (implicit) information provided by one or more oracles 
as a means to learn flexible graphical models with latent variables. Examples of oracles include 
(human) crowds or implicit structural knowledge about the data, such as structural or multi-modal 
constraints without access to explicit features, which are encoded as triplet constraints (see Sec¬ 
tion 2.1 1 . Critically, we consider the oracle similarity constraints as implicit observations generated 


through an unknown process which we include in our model in order to capture subtle knowledge 
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about similarity from the oracle(s). This key idea helps shape explicit, interpretable latent spaces that 
exceed the performance of purely unsupervised learning and can be applied in cases where labels are 
sparse or undesirable. These latent spaces can also be used to explicitly inspect the implicit knowl¬ 
edge passed on by the oracle to the model. Our goal is to infer a latent factor model which learns 
jointly from triplets and observed data and transfers implicit biases encoded in the triplets into an 
explicit latent space that captures the semantics of the triplet-generating process better than simple 
density estimation (see Figure]^. We provide a detailed review of related work in Section|^where 
we explain the relationship of our model in the context of other triplet-loss based metric learning 
approaches and generative models . 


We first describe the two key contributions needed to perform the described task. In Section 2.1 
we introduce a novel probabilistic generative model of oracle observations. We extend this with 
a principled approach for multi-query oracles using masked subspaces in Section The second 


key contribution is described in Section|2^ where we propose a principled approach combining the 
probabilistic oracle model with a graphical model performing nonlinear feature learning in order to 
transfer the implicit triplet knowledge into an explicit parametric model. In Section [Z5] we introduce 
a fast variational inference algorithm to learn posterior latent spaces respecting the observations of 
data and constraints. Finally, in Sections |^and|^ we present experimental results for benchmarking 
the proposed approaches, illustrating their properties and discussing the benefits they confer over 
competing approaches. 


2 Methods 


Let X G denote N observations with D dimensions. We define latent variable 2 ; G 

corresponding to iT-dimensional latent representations of datapoints x. 

2.1 Probabilistic Modeling Of Oracle Triplets 


We consider an unknown (dis)-similairty function SQ($(a;i), <I>(xj)) that computes the distance 
between two objects Xi and Xj with respect to a query Q based on semantic information associated 
with these two objects. We consider 2 : = $(a;) to be the internal conceptual representation the 
oracle uses to apply similarity function for an unobserved feature space $. In addition, 

we consider the case where we can not directly observe the similarity function, but were we only 
observe orderings over similarities of Zi to Zj and zi, i.e., either SQ{zi,Zj) is greater (equal) or 
smaller than SQ{zi, zi). We define the set of all oracle triplets related to query Q as: 

= I SQ{Zi,Zj)> SQ{Zi,Zi)}. (1) 

We do not have access to the exhaustive set 7 q, but can sample iT-times from it using the oracle to 
yield a finite sample Tq^ = {tk}k=i- 

An illustrative example of this process is human perceptual judgement of similarities, which heavily 
relies on internal representations and abstracted concepts to evaluate similarities over purely using 
raw low level image statistics. A frequently used oracle is the crowd. Systems like Amazon Me¬ 
chanical Turk are used to obtain triplet samples to explore the human perceptual prior as an oracle. 
However, oracles also naturally arize from data structure, such as temporal or spatial orderings or 
other known semantic structure. Another type of oracle is access to privileged information, which 
are extra features only implicitly available through the triplets, such as sentiments associated with 
visual features. A shared property of all these oracles is that they provide weak natural constraints 
on similarity without explicitly quantifying it. 

We model the likelihood ti j i of a triplet being contained 7 q as a draw from a Bernoulli distribution 
over the states True and False parametrized using a softmax-function. If we consider 

p{U,j,i)= / p{ti,j,i\zi,Zj,zi)p{z,)p{Zj)p{zk)dzidzjdzk, (2) 


this gives the following likelihood: 

p{tij,i) = Ber{tijj) 



with 


D, 


H 

h^l 




h=l 


(3) 

(4) 
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andjs(p(z;j)||p(4)^ = + iKL(p(2:^)||p,(2^)) withp,(z'') = \p{z'^a) + 

^p{z'^) andKL(g||p) = J q{z)log{^^)dz. 

Z 

We denote the likelihood in Q as BER in the rest of the paper. Since the Jensen Shannon divergence 
above is commonly intractable, we discuss alternatives used in our experiments in Supplement [B| 

A subtlety of the acquisition process for the triplets is that oracles are not asked to provide distances, 
but just binary statements over similarity rankings based on the prompted question. Thus, any triplet 
which fulfills the statement made by the definition is valid. We model this relaxation into a truncated 

Bernoulli likelihood as p(ti i i) = i _ 0.5 j-efej- jq jt ^s TBER later. 

2.2 Masked Oracle Models 


Azimuth 


Elevation 


Frequently, oracle information can be conflicting, especially when using multiple oracles. For in¬ 
stance, Consider colored geometric shapes, where we have a red triangle, a red circle and a blue cir¬ 
cle. If we ask one oracle to compare shapes with a set of triplets and we furthermore ask another ora¬ 
cle to compare colors, we may get conflicting oracle constraints. The circles are more similar shapes 
while the red circle and triangle have more Joint shared 
similar colors. The generated oracle con- Latent Space 
straints cannot easily be jointly fulfilled with O Cg 
uniform global constraints. We extend the 
presented model of oracle observations by 
incorporating masks over the dimensions of 
the latent space, which weigh/select dimen¬ 
sions on which the oracle constraints must 

hold. Since latent variables typically are of Figure 2: Conceptually, observing oracle information from 


Shared/Noise 


Oracle 
Observations 



higher dimension than 1, this approach aims 
at learning semantic latent subspaces where 
specific variables encode features which cor¬ 
respond to semantic information associated 
with an oracle. These subspaces can in gen¬ 
eral be entirely private to an oracle question, 
or may share information with multiple ques¬ 
tions. This leads to a compartmentalization of the semantic representation. In Figure]^ we give 
another conceptual illustration and example that we also use in Results. 

Formally, for a set of iT-dimensional latent variables 2 we define a corresponding H-dimensional 
global mask-variable rrfi which is shared between all samples and is specific to a question/oracle 
Q. Using these masks, we adapt 0 to yield the masked oracle model: 


multiple oracles/questions allows an otherwise fully un¬ 
supervised and unstructured model (left) to identify a se¬ 
mantically compartmentalized generative process by using 
masked subspaces (right). In Results we model images 
of illuminated faces and show that by using oracle obser¬ 
vations the latent space factorizes automatically into sub¬ 
spaces related to different semantic aspects. 


Ffm 


H 

E 

h^l 


= > ^ ruhD, 


■ 


(5) 


We define learned masks by ruh = <j{bh), where bh ~ 1) and cr denotes the sigmoid function. 


2.3 Variational Beliee Networks 

Apart from modeling oracle triplets defined over latent representations 2 : G TZ^, we are interested 
in modeling observations x G TZ^ well. We learn a graphical model to maximize p{x) using 
iF-dimensional latent variables 2 . The latent variables can be drawn from any exponential fam¬ 
ily distribution p{z), but simplifying cases for inference and learning exist for many continuous 
distributions. We can write the model as follows: 

p{x-,9) = j pg{x\z)p{z)dz, (6) 

Z 

withpe(a;| 2 ) := /(tc; 2 , 0) being an exponential family likelihood with parameters given as afunc- 
tion of 2 . Also, / is a function parametrized by 9 (for instance, a multi-layer perceptron). Here, we 
focus on the Gaussian distribution as a prior ^( 2 ) with dimensions H, but note that with small adap¬ 
tations to the inference procedure other distributions are feasible. In this case we predict D means 
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fj,z and variances a\ for each dimension d given the state of latent variables 2 : using pz ^ = fo,fj,,diz) 
and log (Tz.d = fe,a,d{^)- 


A good estimator for learning the param eters of such a model by a s suming an approxima t e condi- 
tionalposterior q^{z\x) was suggeste 


Kingma & Welling 

20131 

Rezende et al. 

(2014i; 

Mnih & 


The estimator forms a variational lower bound Wainwright & Jordan (20081; Jordan et al. ( 1999| l to 
the marginal likelihood. Performing coordinate ascent with respect to variational parameters 9 8c (j) 
corresponds to minimizing the divergence between the true and the approximate posterior; 


logpeix^’-^) > C{e,(j)]x^'^'>) = -KL(g0(2:)m(2;)) +E,^(z)[logpe(a:W|2:)]. (7) 

In the following, we will refer to this fully unsupervised model as VAE. 


2.4 Joint model of triplets and observations: Oracle-Prioritized Belief 
Network 


After having established an observation model for triplets t and an observation model for x, we can 
proceed to introduce the full generative process for a joint model over both observables. Instead of 
relying on a supervised model taking observations x as input, we prefer using a generative approach 
that models the joint density of both data and triplets to provide an unsupervised model which 
requires only input data and samples from an oracle to be trained. 

The advantage is that the generative latent 
variable encoding does not throw away in¬ 
formation about the observations, leading to 
models which explicitly need to capture la¬ 
tent factors generating the data. When ob¬ 
serving oracle-samples, learning leads to ex¬ 
plicit factors of the information from both the 
oracle and the observations. 

We use a b elief network as introduced in 
Section 




2.3 


to model observations x and 
connect the latent variables with an oracle 
observa tion- term for triplets t introduced in 


Section 


2.1 


Triplets require multiple sam¬ 


ples from the prior to be drawn, as they are 
defined over multiple objects jointly. Simi¬ 
lar to the inference model in the Siamese net- 


\e ^1 



work (Chopra et al. 2005 1 , this necessitates 


Figure 4: Shown is the proposed joint model. It models 
observations x and triplets from an oracle t as observed 
(shaded) variables. The latent space with variables z causes 
the shaded variables and thus captures the information nec¬ 
essary for modeling both. 


multiple instances of the model with shared 
parameters to work in coordination to gener¬ 
ate a triplet. We sketch the generative model 
in Figure]^ 

For our proposed joint-model that we will refer to as OPEN, we consider N datapoints and K 
triplets defined over them; 

» N K 

p{x,t\e)= j W_[p{Zn)pe{Xn\Zn)\W[p{tk\Zki,Zk^,Zki)\dz ( 8 ) 


The generative process according to this model is: 


1. sample Zi, Zj, zi from prior; z ~ A/'(0,1); 

2. for each z, sample observation x using nonlinear likelihood, e.g. p(x\z) = N{x\pz, 

3. for each set {i,j, 1}, sample triplet tijj ^ p{t\zi, Zj,zi). 


Triplets tie together multiple datapoints and capture their dependencies through the latent repre¬ 
sentations. This has the effect of attaching higher-order potentials to the latent space, which the 
model uses for regularization and guidance. It is noteworthy that learning consists of maximizing 
the marginal likelihood p{x,t) by integrating out the latent z’s. This directly maximizes the evi¬ 
dence coming from the oracle and the observations, while maintaining flexibility for the model used 
in-between. This model balances a reconstruction cost for the datapoints, the generative cost for the 
triplets and the prior on the latent variables when generating samples. 
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2.5 Learning using fast Variational Inference 

Our goal is to maximize the marginal likelihood of the evidence, logpg (x, t) in order to learn a good 
mapping capturing the dependencies between observations x and triplets t. This involves integrating 
out the latent variables which is in general analytically intractable in highly flexible model classes. In 
order to perform efficient learning and inference in the model given by ([8]) we resort to approximate 


inference methods and employ doubly stochastic Variational Inference (Kingma & Welling 
Rezende et al.[|2014[ Titsias & Lazaro-Grediila]|2014|l. Variational Inference ([Jordan et al. 


2013 


1999 


Wainwright & Jordan |2008|l requires approximate distributions q{z) over the posterior of the latent 
variables. We use amortized inference by employing an inference network to learn a conditional 
variational distribution q^{z\x) with an MLP parametrized by </>. The inference model predicts the 
variational approximation to the posterior latent variables per input data point. The evidence lower 
bound (ELBO) looks as follows: 

logpg{x,t) =C{e,(l};x,t) +KL{q{z)\\p{z\x,t)) > C{e,(j);x,t) 


= -Er. 


KL(q(2:”)m(2:)) 


+ E„ 


E9(2)[logpe(a;”|2;”)] 


■ Efe 


E,(.)[logp(t"|z'=-^')] 


where k^i acts as an index on matrix z selecting the corresponding datapoints. Theoretically, per¬ 
forming coordinate ascend on this lower bound is sufficient to infer the parameters of the model 



where {/i^, A^} are predicted variational parameters using the inference network and e* ~ A/^(0,1) 
are unbiased samples from a unit Gaussian. The differentiable new bound C{0, (p; x, t) then takes 

L r L 


the shape: -E„ KL{q^{z^\x'^)\\pe{z)) 


+ E„ 


T E[logPe(a:”|z‘)] 


1=1 


■E. 


E[iogp(^ |z ‘J‘)] 


1=1 


On this new objective we can now perform gradient-based learning by following VC{0, 0; x, t) with 
respect to global variational parameters p, 6 . We perform stochastic gradient descent by drawing 
minibatches with Ng datapoints and Kg triplets each time. 

When learning masks m as in Section 


served data p(m|I?). We use variationa 


2.2 we infer posterior distribution of the masks given ob- 


inference analogously to learn an approximate distribution 


q(m^; cr^) by adding the KL loss for the masks to the ELBO while taking into account 

the state of the mask variable in the triplet likelihood. 

Upon close inspection we detect that the components q^{z\x) and po{x\z) form a variational au¬ 
toencoder where the parameters have distilled the triplet information. This also clarifies where the 
transfer of implicit information from the triplets to the learned parametric model happens. In simple 
terms, the formulation of the model forces the inference network to learn encodings respecting the 
triplets and the model p 0 {x\z) decodings which account for that shared information. 

3 Relationship to other work 

Similarity-based learning, for instance via crowdsourcing, has been tackled in various ways in the 
community before. Notably, crowd-kernels are inferred and used for various vision tasks using Van| 
Per Maaten & Weinberger] pO 12 1 which assumes a fixed Student-t structure to produce an embed 


ding using similarity constraints from a crowd, but does not learn an adaptive latent representation of 
the input features. In Chechik et al. ( 2010| l, a metric respecting the particular distances in similarity 
is learned. This differs from the case we are studying, as it assumes that specific distances or simi¬ 
larities are observed, which is hard to ask of a weak oracle. In Tamuz et al. ( |2011 1 , a probabilistic 
treatment for triplets is introduced and an adaptive crowd kernel is learned without specific visual 
features in mind. While we also adopt a probabilistic treatment of triplets, we will learn an adaptive 
feature representation comparing images from the crowd as well. 

Elexible nonlinear models have been employed in a variety of situations to learn representations for 
data. A key result in relation to this work is the Siamese network ( jChopra et al.[|2003] l, which uses 
discriminatively learned features and refines them using a loss attached to the encodings of multiply 
winged networks over the compared images. A similar version was later also developed which just 
uses the oracle triplets as supervision instead of refining a supervised version of the features, which 
is a setting we also consider ([Hadsell et al. 2006|l. Similar approaches have been used in Wang 


et al. ( 2014[ ); Schroff et al. ( 2()15| l, where usage of supervised features with crowd-inferred simi¬ 
larities boosts performance in face verification and more generic fine-grained visual categorization 
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tasks. The key difference to our work is two-fold; we focus on a probabilistic encoder-decoder ap¬ 
proach, where features are learned from images without labels and image information is not thrown 
away. Feature learning is guided additionally by an oracle and we introduce a probabilistic gener¬ 
ative model which provides a joint model of all these components and their interactions, including 
semantic masking. This forces our model to learn explicit latent factors which capture the knowl¬ 
edge from the oracle rather than learning to be invariant to it and thus constitutes a harder and more 
comprehensive task. 

Bayesian generative models have been proposed before for crowd-sourcing tasks ( |Liu et al^|2012| l. 
Our model differs in that we introduce latent variables generating the observations on which we eval¬ 
uate triplet constraints, which are the observation from the oracle. Our setup better facilitates implicit 
knowledge transfer via posterior regularization. Generative models in representation learning have 
recently made rapid progress using variational inference ([Kingma & Welling 2013[ Rezende et al. 


2014[ [Mnih & Gregor 2014|l. These techniques allow fast learning of directed graphical models 


and have been a major stepping stone in combining deep learning with graphical models. We briefly 
review variational autoencoders in Section 2.3 Notably, in Kingma et al. (|2014|l these approaches 


are used to achieve state-of-the art results in semi-supervised learning with explicit labels. We iden¬ 
tify that as a related setting to ours: using an oracle we can obtain weak implicit supervision in the 
form of similarity constraints over a sparse subset of the data and generalize from that, while most 
of the data is not subject to oracle constraints. Furthermore, we similarly can learn functionally 
compact subspaces which have semantic roles during generation. In [Cheung et al.| ( [2014 1, deep 
generative models are used with functional constraints on the latent space to increase specificity of 
latent variables, which is a goal we share but tackle using the oracle-information as a model-based 
semantic regularizer. Disentangling information and structuring models semantically is a theme in 
more recent work ( [Reed et ar||2014| l, ( jKulkarni et al.||2()15| l. Constraints on latent variables mod¬ 
els in an otherwise unsupervised setting have also found early usage in the context of Gaussian 
Processes (Lawrence & Quinonero-Candelaj [2006j l using backconstraints. 

Using side-knowledge as a regularizer for the posterior over latent variables has been explored in 
other settings for simpler latent variable models in Ganchev et al. (2010 1 and we take inspiration 
from that work. An interesting link also exists between our formulation of the triplet likelihood 
using the Jensen Shannon divergence and generative adversarial networks ( jGoodfellow et al.||2014| l. 
The Bernoulli likelihood we employ using a softmax can conceptually be adapted to use a classifier 
to match the framework fromjGoodfellow et al.|(|2014ll. 


Finally, an intuitive connection also exists with Vapnik’s privileged learning framework (Vapnik & 
Vashist 2009| l where in a supervised setting improved classifiers can be learned if privileged infor¬ 
mation in the form of additional features is present during training time. Borrowing terminology, 
we consider the similarity constraints to be a sparse privilege conveyed by an oracle of unobserved 
structure and aim at learning a student model which improves understanding of the data. Our gen¬ 
erative interpretation of this setting ultimately leads to our approach of learning a pseudo-causal 
inverse model of the data guided by oracle information by modeling factors of variation, instead of 
learning invariances as in Hadsell et al. (2006|l; [Chopra et ak (20051. 


4 Results 


The aim of this section is to illustrate the key properties of our proposed algorithms. We start by de¬ 
scribing the dataset and preliminaries in Section 4.1 In the first part (Section [4!^ , we quantitatively 
compare our methods against other baseline and state-of-the-art methods. In the second part (Sec¬ 
tion [4.3[ l, we illustrate how our model variant with masks factorizes the latent spaces into distinct 
semantic units. 


4.1 Preliminaries 

We use a relatively small dataset that is, however, well-suited to illustrate the features of the algo¬ 


rithm and facilitates the interpretation of the factorized latent spaces: the Yale Faces dataset ( Lee 
[et al.[|2005] l. The version we used comprises of 2,414 images from 38 individuals under different 
light conditions. We split it into 300 test images and 2,114 training images. The images were taken 
under controlled conditions using a lighting rig which allows for light sources to be varied in spe¬ 
cific ways. The azimuth and elevation of the light in relation with the depicted face were changed 
with values between —130 to -(-130 degrees and —40 to 90 degrees, respectively. The resulting im¬ 
ages have dramatic variability in appearance due to shading, apart from variability in identity of the 
depicted person. 
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We proceeded with a series of oracle simulations. Particularity, we simulate three different questions 
upon presenting it with random triplets of images which we will then use the evaluations below. The 
questions we used were the following: 

1. Who has the most similar identity? (’’Identity”) 

2. Where is the light condition most similar in terms of azimuth? (’’Azimuth”) 

3. Where is the light condition most similar in terms of elevation? (’’Elevation”). 


While the first question is similar to a typical classification setting, answering it accurately may 
actually require the ability to understand light variation well. Question 2 & 3 concern complex 
qualities of the images related to visual physics. The used Yale faces dataset provides metadata for 
all images and we can use this metadata to simulate a number of triplets for each oracle. 


4.2 Comparison with baseline and state-of-the-art methods 


For evaluation we aim at a more quantitative understanding of what our and other models are good 
at and where they have limitations. For this we consider the metric learning network analogous 
to |Hadsell et al. (2006i and the a purely unsupervised variational autoencoder (see Section as 
comparators. We refer to them as MetricL and VAE, respectively. VAE is independent of oracle 
triplets in all experiments since it works entirely unsupervised. 


We use the learned representations of each model to assess the quality with respect to different 
evaluation measures. In particular, we use the representations to predict the identity of the face, the 
azimuth degree and the elevation degree (we provide classification error and RMSD for the degrees). 
The latter is to assess how well the models capture physical properties of the images. All evaluations 
are done on held-out test data using a logistic regression model. In addition, we measure how well 
each model is able to predict triplets on test data, i.e., predict whether triplet i,j, I or i, l,j is true. 
We provide further information for model details and experimental setup in Appendix [A| 

The results reveal that OPEN and its variants are on average the best-performing method. MetricF 
effectively learns a classifier in this setting. In Table [Ta| we see that the generative model competes 
with the metric learning method in terms of classification when being informed about identity from 
the oracle, while maintaining low eiTor rates for tasks related to image physics. It outperforms 
VAE on classification accuracy, at no loss to image understanding. We believe this shows that it 
incorporates oracle knowledge to shape alternative latent spaces compared to VAE. In Table [Tb] we 
test how well a model can incorporate more subtle oracle knowledge. We inform models only using 
a light condition oracle (azimuth). As expected, the metric learning performance collapses on all 
tasks except on the targeted oracle-task. VAE maintains the same performance since it is agnostic 
to oracle information. OPEN on the other hand maintains good performance on all tasks, benefits in 
predicting light conditions over the unsupervised VAE. 


In a more complex setting, we also give the models oracle-information from all available questions 
jointly and test their performance. As we show in Table again OPEN is the best-performing 
method on average. Metric learning approach cannot incorporate the variability of the available 
information usefully when using just a few triplets (data not shown), but in the setting we report 
using 100000 triplets it achieves good performance on classification and elevation prediction. The 
main difference in performance between VAE and OPEN is in classification and the ability to pre¬ 
dict triplets, which correlates with our observation that only about 50% of training triplets would 
be satisfied with the VAE approach. We see, that OPEN learns an equally competitive, but clearly 
different latent space than VAE and captures the semantics of the oracle better by being more pre¬ 
dictive on unseen triplets on test data. However, the benefits of the oracles when incorporating 
multiple queries are underwhelming in comparison to single oracles. To address this issue, we 


Oracle/Method 

MetricL 

VAE 

OPEN 

Identity 

70.0 

18.2 

18.7 

Azimuth 

13.5 

20.4 

16.4 

Elevation 

18.0 

10.5 

10.2 

Triplet prediction 

3.4 

34.00 

6.60 


Oracle/Method 

MetricL 

VAE 

OPEN 

Identity 

9.0 

18.2 

9.0 

Azimuth 

36.3 

20.4 

20.6 

Elevation 

20.0 

10.5 

10.4 

Triplet prediction 

1.25 

34.00 

6.42 


(a) Model trained with the identity oracle. (b) Model trained with the azimuth oracle. 


Table 1: Comparison of Metric Learning Networks (MetricL), Variational Autoencoders (VAE) and our pro¬ 
posed model without masks (OPEN). We train the model with 100,000 triplets from the identity oracle (left) 
and the azimuth oracle (right). Best results results are in bold face. Second best results italic. We observe that 
OPEN predicts all properties reasonably well, while MetricL only for the task it is trained for. VAE works well 
on predicting lighting conditions. For more details see main text. 
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Oracle/Method 

MetricL 

VAE 

OPEN 

OPEN Mask 

OPEN Mask 500k 

Identity 

10.2 

18.2 

9.7 

9.0 

7.3 

Azimuth 

27.1 

20.4 

19.8 

13.8 

14.1 

Elevation 

9.1 

10.5 

10.0 

6.5 

6.0 

Triplet prediction 

27. r 

34.0* 

5.7 

5.2 

4.2 


Table 2: Comparison of Metric learning networks (MetricL), Variational Autoencoders (VAE) and three variants 
of our model (OPEN in Figure [T] OPEN with masks and OPEN with masks and 500,000 triplets instead 
of 100,000 triplets per oracle). We use triplets from the three oracles Identity, Azimuth and Elevation. We 
observe that OEPN with masks performs much better on all evaluations. Using more triplets leads to further 
improvements. Numbers marked with * should be considered with care since MetricL and VAE are not aware 
of differences between oracles. See main text for more details. 


add an experiment using Masked-OPBN. We observe that masked OPEN exhibits greatly improved 
quantitative performance on all tasks and yields representations which are more predictive of the 
image-properties, the class and the held out triplets than the other models. To further test this capac¬ 
ity after seeing that OPEN stalls in its improved performance, we add an extra experiment with 5 
times the triplets for Masked OPEN. We observe that masking allows the model to continue improv¬ 
ing as more triplets are added. OPEN with masks thus shows the greatest promise to incorporate 
heterogeneous information from oracles into its latent spaces. 


4.3 Masking OPEN Leads To Factorization of Latent Spaces 


We observed that the masked version of OPEN 
shows greatly improved ability to learn from com¬ 
plex oracles with multiple heterogeneous queries 
cormared to other discussed approaches (see Fig¬ 
ure^. For a model with otherwise equal paramet¬ 
ric capacity and the same fundamental inference 
machinery, this constitutes a surprising observa¬ 
tion. In this section, we will illustrate the effects 
of the learned masks and how they contribute to 
performance improvements. 

In the non-masked models triplet likelihoods are 
global. Ey learning local likelihoods via mask¬ 
ing, or subspaces, we allow the model to decide 
which parts of the space it uses per query. Varia¬ 
tional compression leads to solutions with the least 
possible amounts of used variables. In Figure 
we show a fully learned mask for the Yale Faces 
model for each query. Learning jointly from all 
queries leads to a factorization of the latent space 
into task specific and partially shared latent vari¬ 
ables. We also observed a strong quantitative foot¬ 
print from usage of these masks: performance on 


Identity subspace 


E 


u. 


Azimuth subspace 



Elevation subspace 


i 


Latent dimensions 

Figure 6: We show the oracle-specific masks over the 
latent space when learning from multiple oracles at 
once. It is evident, that the model learns to virtu¬ 
ally switch off different dimensions per question and 
learns to factorize the latent space into compartmen¬ 
talized, task-relevant subspaces without ever explic¬ 
itly receiving supervision, for instance, about how to 


factorize light from a face class. 

all predictive tasks for models of otherwise equal capacity improves across the board (see Table |^, 
leading to models which capture light conditions and class better jointly. We consider this an effect 
of knowledge transfer from the oracle/crowd, allowing the model to identify semantic latent variable 
systems which do not only strive for high likelihoods on pixels, but also help model oracle triplets. 
We also observe that models with masks improve dramatically with availability of more triplets. The 
Eayesian objective helps compress the latent spaces into semantic variables. 


In order to inspect the latent spaces induced by the masks, we embed the respective subspaces of the 
latent variable encoding using t-SNE in Figure]^ It reveals that we leam fine-grained class clusters 
when using the identity subspace, and continuous and smooth embeddings for azimuth and elevation 
pointing to the understanding of the model of the continuous and nature of light placements in the 
images. These results point to the fact that the oracle-informed model is able to learn the semantics 
of light placement and of facial structure in dedicated subspaces, increasing the semanticness of 
the learned space significantly. This h^s identify semantic variables which were explicitly never 
observed, as sketched earlier in Figure ^ We finally present an example of using masks to sample 
synthesized images in Figure This illustrates the controlled transfer of subtle imaging-physics 
properties from one image to the next using the model. 
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Identity subspace 
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Figure 7: t-SNE visualizations of the latent subspaces that were identified in the model with shown weights 
in Figure]^ For visualization, we only use the dimensions with weight greater than 0.2 for each oracle. We 
observe that the identity subspace clearly separates faces from different persons (left), the azimuth degree 
(middle) and elevation degree (right) of light exposure. 

Figure 8: We illustrate the factorization of latent spaces when us¬ 
ing masked models in the following. We take two images from the 
training-set, A and B, and project them into latent space. We use 
the encoding for face given by the identity mask from image B and 
combine it with the latent features given by the azimuth mask ap¬ 
plied to the encoding of image A. The resulting image is a blend 
of two, as expected, and approximates the facial features of image 
B, especially the mouth region and facial shape. The blended im¬ 
age furthermore exhibits light properties similar to image A. We 
finally show an unobserved test image which shows face B under 
light conditions A for comparison. We see that the facial transfer 
is not perfect, as the eyebrows are still taken from image A and the 
skin shading is a blend of both images. These are fair mistakes, 
since eyebrows are frequently shaded or mixed up with to light 
Synthetic: Face B Light A Original: Face B Light A conditions in this dataset. We expect this to improve on bigger 

datasets and when using more oracle samples. 



5 Discussion 

We have introduced a joint unsupervised generative model over observations and triplet-constraints 
as given by an oracle. Our contributions are first a fully probabilistic treatment of triplets and latent 
variable models in a joint unsupervised setting using variational belief networks. We show how this 
joint learning allows for implicit knowledge from an oracle, such as a human crowd, to be transferred 
to a rich parametric model, resulting in improved classification scores, improved ability to predict 
triplets and more interpretability of the crowd biases. This can be a useful framework to encode 
expert knowledge in probabilistic reasoning systems when the exact model is unknown or labels are 
hard to obtain. Second, we introduce information theoretic distance measures for triplets generaliz¬ 
ing the commonly used Euclidian distances. We furthermore introduce the notion of question spe¬ 
cific masks in latent space to force the model to identify interpretable features of relevance for each 
specific type of oracle constraint, enabling the model to learn from multiple types of questions at 
once and boosting performance further. Our approach using variational inference and a triplet likeli¬ 
hood is not limited to belief networks, thus it will be interesting to use the framework in conjunction 
with other flexible probabilistic models such as Gaussian Processes and infinite partition models. 
We highlight the fact that using our framework no supervised pre-training of features is needed, 
as it can learn problem specific nonlinear feature-spaces adapted to the available information. We 
showed that our approach compares favorably with state-of-the-art metric learning models and fully 
unsupervised method in a generic application using feedforward networks. Our model is trivially 
extendable with convolutional and de-convolutional networks to be used on high-dimensional data. 
It will be interesting to combine the learning approach with more structure in temporal or spatially 
constrained models and encode other relationships like topological or unobserved constraints, such 
as taste of food in images. On the oracle side, future work regarding more accurate crowd-modeling 
for different bias and noise regimes are promising in conjunction with use-cases such as amazon 
mechanical turk. Our model is also particularly amenable to active learning for probing the ora¬ 
cles optimally. Finally, we wish to mention the potential for this framework to assist perceptual 
applications where biases of the human visual system can be studied assisted by generative models. 
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A Appendix A: Experimental Settings 


Here we can give more experimental details. 

In all experiments we used diagonal Normal distributions as priors for the latent space and rmsProp 
with momentum ( Graves[ 2013| l or ADAM ( |Kingma & Ba 2014|l as an optim izer. All experiments 
were run on Graphics Processing Units (GPUs) using a theano (Bergstra et al. 2010| l implementation 
and did not take more than a few hours each. 


We can simulate an oracle for each question by using the annotations provided with the dataset. For 
question one, we sample from the label distribution checking for a match to produce answers to the 
triplets generated. For question two and three we resort to sampling from the relative distances of 
target angles to given angles to produce the triplet information. We finally generate 3 different sim¬ 
ulated oracles, OraclelD, OracleAz and OracleAll which cotTespond to asking just the first question, 
just the second or all three mixed. We sample 100,000 triplets for each oracle-question at random 
(meaning that the third simulation has 300000 triplets). We repeat the process 3 times to account for 
sampling bias and report the means of the reruns on our experiments. In an extra illustrative example 
combining all three oracles, we samples 500,000 triplets per question for a total of 1,500,000 triplets 
to inform the model. While these numbers sound high, we note that there is a large combinatorial 
space of possible triplets to be explored. 

We proceed to learn fully unsupervised models of these images using an architecture with 200 hid¬ 
den deterministic units and 50 latent variables. The deterministic layers use tank nonlinearities. 
We set up the analogous MetricL model without a generative path as a supervised learning model 
optimizing the triplet embeddings given images with a euclidian loss function. 


B Appendix B: Details for Oracle-Likelihood 


In order to compute the likelihood for the triplet likelihood, we need to calculate an expensive diver¬ 
gence term D using an information theoretic quantity, the Jensen Shannon Divergence as defined in 


Section 2.1 In practice, this term is typically intractable analytically since it involves a KL diver¬ 


gence involving over a mixture over two possibly disjoint distributions. In order to evaluate this KL 
divergence, exhaustive sampling methods need to be used. 

In order to avoid expensive sampling steps during training, we explore approximations to the term 
D. In the presented experiments we used; 


Dn h = 


^=1 


H 

E 

h=l 


1 , 


-KL 


(p(^^)|b(4))+^KL(p(4)|b(^^)) 


This approximation is inaccurate globally, but empirically is fast and yields better results than the 
KL divergence or a eucilidian distance and becomes accurate in the limit of closeby distributions. 
Clearly, using the full JS is beneficial to the model and yields stronger posterior regularization 
allowing to learn more efficiently from triplets, especially in combination with full covariance latent 
spaces. An overview of previous approximations related to the JS is given in ( Hershey et al.[ 2007| l. 
We have tried previously known Monte Carlo-based approximations and explore novel deterministic 
approximations to this term and expect to show empirical performance in an update to this paper and 
in follow-up work. 


C Appendix C: Further Yale Faces Samples 

We trained a model on Yale Faces with 400 hidden units (units chosen until likelihoods stopped 
improving) and used it similarly to the masked experiments in the main paper. We use the space in 
the supplement to show a few more samples in Figure]^ and do a form of image algebra by adding 
components of various images together. 
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Face 


Light 








Figure 9: Top We select a trainings image of a face and select the latent encoding corresponding to its identify. 
Middle We select three training images of another face under different light conditions and select the light 
variables according to the mask. Bottom We synthesize new images with the face and light images clamped to 
the observations and see that noisy faces are generated which look like the top face and have light conditions 
like the middle one. 


D Appendix D: Yale Faces Triplet Variations 

We sample another batch of 100000 triplets for the Yale dataset per query and rerun OPBN-Masked 
with a varying number of triplets to clarify the effect. We show results in Table[^ where it is evident 
that all queries improve as we add triplets. We want to note that these numbers are based on a single 
sampling of triplets and thus are subject to sampling noise. By chance, more or less good triplets 
may be contained in the set. 


Table 3: Comparison of Model metrics on Yale between with varying triplet numbers. 


OracleAll 

100 

1000 

10000 

100000 

Triplet Prediction 

35.95 

28.88 

22.95 

5.56 

Classification 

19.00 

15.66 

12.66 

8.66 

Azimuth RMSD 

19.59 

18.73 

17.02 

15.50 

Elevation RMSD 

9.59 

10.99 

7.75 

6.37 
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Figure 10: We show the generated training data. Left the original image Right progressively rotated version of 
the original, depending on position on trajectory. 


E Appendix E: MNIST Experiments 

We generated a perturbed version of the MNIST dataset in order to show other settings where the 
proposed approach can be used interestingly. In normal MNIST, each letter is generated depending 
on its class. By eyeballing, style variations can be seen, but they are not captured in the meta-data 
in order to be used for evaluation. We proceed to take 10000 MNIST digits of equal proportions 
from each class and rotate them by 5 progressively increasing positive angles. This creates the effect 
of pushing the digits to fall over towards the right side in a trajectory, as shown in Figure The 
questions we can ask simulated oracles here are the following: 

1. Which image of a set is part of the same trajectory? This question is related to both identity 
and instantiation (style) of a variable. 

2. Which images have similar/dissimilar angles/timepoints? 

3. Which images have similar labels? 

We chose not to include order into the labels, but similarity in digit-images could also be defined 
by the value of a digit which could be used for reasoning tasks such as performing mathematical 
operations with inferred values in an ordered manifold. In our setting we assume similarity is implied 
by the same label and dissimilarity else. In future work we plan to exploit semantic oracles which 
understand order for reasoning-related tasks. In our experiment using 2 deterministic layers with 
800 and 400 hidden units with tanh nonlinearities and 50 latent variables we find OPEN masked to 
outperform VAE, see Table The mask variables also manage to factorize the latent space sharply 
into variables for each query in the setting of 100000 triplets. We also observed that when using 
less triplets the second query suffered the most in performance, whoch makes intuitive sense since 
learning a rotation is a harder task than learning to match labels. In terms of final performance we 
observe that OPEN can strongly reduce the predictive errors for the tasks, although the two synthetic 
tasks are actually quite hard. 

Table 4: Comparison of Model metrics on MNIST between VAE and OPEN masked with 100, 1000 and 100000 
triplets. 


OracleAll 

VAE 

OPBNmasked-100 

OPBNmasked-1000 

OPBNmasked-100000 

Triplet Prediction 

36.51 

34.35 

34.32 

16.45 

Classification 

24.19 

23.05 

22.61 

10.68 

Rotation Angle RMSD 

18.87 

18.4 

18.7 

14.28 
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Table 5: Comparison of Model metrics on Yale Faces with Identity crowd. 


OracleID e = 0 

MetricL 

OPEN 

Azimuth RMSD 

57 

20.99 

Elevation RMSD 

24.3 

10.13 

Classification% 

14 

7.2 

Triplet Prediction % 

5.0 

5.3 


Table 6: Comparison of Model metrics on Yale Faces with Azimuth crowd 


OracleAZ e = 0 

MetricL 

OPEN 

Triplet Prediction % 

10.8 

18.1 

Azimuth RMSD 

18 

19.8 

Elevation RMSD 

18.59 

11.7 

Classification % 

40.7 

7.9 


F Appendix F: Robustness to Oracle-Noise 

Here we train Metric Learning and OPEN with 2,000 triplets and 2,000 datapoints in a variety of 
oracle settings. Oracles are perturbed by noise e, meaning that a fraction equal to e of the triplets 
are flipped and thus wrong. The experiment illustrates the robustness the generative aspect of ghe 
model gives it, whereas it is evident that metric learning approaches lose more performance since 
they cannot benefit from modeling observations directly. We observe that OPEN learns significantly 
better representations to predict azimuth, elevation and the classification label. OPEN performs 
similarly good in terms of triplet prediction. 


Table 7: Comparison of Model metrics on Yale Faces with All-oracle. 


OracleAll £ = 0 

VAE 

MetricL 

OPEN 

OBPN-Masked 

Triplet Prediction % 

38.7 

33 

18.4 

9.7 

Azimuth RMSD 

21.18 

40.42 

21.77 

16.37 

Elevation RMSD 

11.57 

14.75 

10.23 

7.14 

Classification % 

17 

25 

12.6 

13.6 
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Table 8: Comparison of Model metrics on Yale Faces for noise noise robustness when using MetricL. 


MetricL-All 

e = 0 

e = 0.2 

d 

II 

Triplet Prediction % 

67 

59.9 

52 

Azimuth RMSD 

40.42 

35.32 

41.42 

Elevation RMSD 

14.75 

18.62 

21.6 

Classieication % 

75 

71 

71 


Table 9: Comparison of Model metrics on Yale Faces for noise robustness when using OPEN. 


OPEN 

e = 0 

e = 0.2 

e = 0.4 

MetricL-All 

e = 0 

e = 0.2 

e = 0.4 

Triplet Prediction % 

18.4 

37.6 

46.2 

Triplet Prediction % 

67 

59.9 

52 

Azimuth RMSD 

21.77 

22.9 

23.75 

Azimuth RMSD 

40.42 

35.32 

41.42 

Elevation RMSD 

10.23 

10.89 

10.72 

Elevation RMSD 

14.75 

18.62 

21.6 

Classification % 

92.6 

89.3 

89.1 

Classieication % 

75 

71 

71 
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